Skip to content

Reconciliation plan#13748

Open
ndeloof wants to merge 17 commits into
docker:mainfrom
ndeloof:reconciliation_plan
Open

Reconciliation plan#13748
ndeloof wants to merge 17 commits into
docker:mainfrom
ndeloof:reconciliation_plan

Conversation

@ndeloof
Copy link
Copy Markdown
Contributor

@ndeloof ndeloof commented Apr 19, 2026

this is an alternative for #13641
based on previous work, I first generated an analysis an plan document, so AI work on this significant refactoring is better guided and impact is well understood

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 19, 2026

@ndeloof ndeloof force-pushed the reconciliation_plan branch 6 times, most recently from a057fa0 to 46e274e Compare April 20, 2026 07:42
@ndeloof
Copy link
Copy Markdown
Contributor Author

ndeloof commented Apr 20, 2026

/review

@ndeloof ndeloof marked this pull request as ready for review April 20, 2026 08:06
@ndeloof ndeloof requested a review from a team as a code owner April 20, 2026 08:06
@ndeloof ndeloof requested review from Copilot and glours April 20, 2026 08:06
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces a new “reconciliation plan” pipeline for docker compose create by replacing the legacy convergence flow with: observed state collection → plan generation (DAG) → parallel plan execution.

Changes:

  • Added new reconciliation components: ObservedState, Plan, reconcile logic, and a parallel executePlan executor with grouped progress events.
  • Refactored create() to use the new pipeline (including “Running” events for unchanged containers) and removed most of the old convergence implementation.
  • Updated run to resolve service: references without convergence, and added/updated unit tests for the new modules.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
reconciliation.md Architecture/design doc for the new reconciliation engine and pipeline.
pkg/compose/run.go Replaces convergence-based service-reference resolution for run with a new helper.
pkg/compose/reconcile_test.go Adds unit tests validating plan output for networks/volumes/containers/orphans.
pkg/compose/reconcile.go Implements reconciler producing a deterministic DAG plan for infra + containers.
pkg/compose/progress.go Removes runningEvent helper (replaced by direct newEvent usage).
pkg/compose/plan_test.go Adds unit tests for Plan and OperationType stringification.
pkg/compose/plan.go Introduces Plan, PlanNode, Operation, and OperationType abstractions.
pkg/compose/observed_state_test.go Adds tests for observed-state snapshot construction and classification.
pkg/compose/observed_state.go Adds snapshot types + collection logic; adds “Running” events emission helper.
pkg/compose/executor_test.go Adds executor tests using mocked API client calls.
pkg/compose/executor.go Implements parallel plan execution, operation handlers, and grouped progress events.
pkg/compose/create.go Refactors create() to build observed state → reconcile → emit running events → execute plan.
pkg/compose/convergence.go Removes convergence struct/logic; keeps/exports service-reference resolution helpers.
pkg/compose/containers.go Adds getContainersByService() and removes Containers.names().

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/compose/executor.go
Comment thread pkg/compose/executor.go Outdated
Comment thread pkg/compose/reconcile.go
Comment thread pkg/compose/reconcile.go
Comment thread reconciliation.md Outdated
Comment on lines +8 to +10
ensureImagesExists -> ensureNetworks/Volumes -> collectObservedState -> reconcile -> executePlan
(inchange) (inchange) (1) (2) (3)
```
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling: "inchange" should be "inchangé".

Copilot uses AI. Check for mistakes.
@glours
Copy link
Copy Markdown
Contributor

glours commented May 21, 2026

/review

@docker-agent
Copy link
Copy Markdown

docker-agent Bot commented May 21, 2026

PR Review Failed — The review agent encountered an error and could not complete the review. View logs.

@glours glours force-pushed the reconciliation_plan branch 2 times, most recently from 8f2d032 to b812190 Compare May 28, 2026 13:58
@glours
Copy link
Copy Markdown
Contributor

glours commented May 28, 2026

/review

@docker-agent
Copy link
Copy Markdown

docker-agent Bot commented May 28, 2026

PR Review Failed — The review agent encountered an error and could not complete the review. View logs.

@ndeloof ndeloof force-pushed the reconciliation_plan branch from b812190 to f418dc4 Compare May 28, 2026 14:29
@glours glours force-pushed the reconciliation_plan branch from f418dc4 to b812190 Compare May 28, 2026 14:31
ndeloof added 6 commits May 28, 2026 16:32
Signed-off-by: Nicolas De Loof <nicolas.deloof@gmail.com>
String() is designed to make it easy to compare coomputed plan vs expectations

Signed-off-by: Nicolas De Loof <nicolas.deloof@gmail.com>
Signed-off-by: Nicolas De Loof <nicolas.deloof@gmail.com>
Signed-off-by: Nicolas De Loof <nicolas.deloof@gmail.com>
Signed-off-by: Nicolas De Loof <nicolas.deloof@gmail.com>
Signed-off-by: Nicolas De Loof <nicolas.deloof@gmail.com>
ndeloof and others added 3 commits May 28, 2026 16:32
Signed-off-by: Nicolas De Loof <nicolas.deloof@gmail.com>
- Remove the `convergence` struct and `newConvergence` constructor
- Extract `resolveServiceReferences` as a standalone function taking
  `map[string]Containers` instead of a method on convergence
- Add `getContainersByService` helper on composeService
- Update run.go and executor.go to use the new standalone function
- Remove dead code: `getObservedState`, `setObservedState`

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Nicolas De Loof <nicolas.deloof@gmail.com>
…remove

- Fix ContainerReplaceLabel detection: use op.Inherited != nil (not
  op.Container) as signal for recreate in execCreateContainer
- Use observed network name (not desired) for DisconnectNetwork and
  RemoveNetwork operations, in case the name changed
- Use observed volume name (not desired) for RemoveVolume operations
- Update reconciliation.md with 3 new lessons learned (7.8, 7.9, 7.10)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Nicolas De Loof <nicolas.deloof@gmail.com>
@ndeloof ndeloof force-pushed the reconciliation_plan branch from b812190 to 08da1d5 Compare May 28, 2026 14:32
@glours glours force-pushed the reconciliation_plan branch from 08da1d5 to aaa34eb Compare May 28, 2026 14:34
glours and others added 5 commits May 28, 2026 16:36
- plan.go: pin OperationType constants to explicit values so adding an op
  in the middle doesn't shift the others.
- executor.go: remove the meaningless `var _ = getContainerProgressName`
  line — same-package functions are always accessible.
- reconcile.go: fix the stale switch-default comment that contradicted
  the case clause above it.
- reconcile.go: drop the local `serviceLabel` const that shadowed
  `api.ServiceLabel` and use the shared constant.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Guillaume Lours <glours@users.noreply.github.com>
ensureProjectVolumes already prompts the user when a volume's config hash
diverges from the compose file (create.go:1626) and recreates it on confirm.
The reconciler ran after ensureProjectVolumes and prompted again with the
exact same message — so a user who declined the first prompt was asked the
same question a second time.

Drop the prompt + recreate call from reconcileVolumes(). Recreation of
diverged volumes stays owned by ensureProjectVolumes; the reconciler only
plans the creation of missing volumes. If the user declined recreation,
the existing container's mounts still match the existing volume name and
hasVolumeMismatch correctly returns false, so containers are not falsely
flagged as obsolete.

Keep the supporting infrastructure available for future use, when
divergence detection migrates fully into the reconciler:

  - reconciler.prompt field
  - prompt parameter on reconcile()
  - planRecreateVolume function (//nolint:unused)
  - servicesUsingVolume function (//nolint:unused)
  - noPrompt test helper

The reframed test (TestReconcileVolumes_DivergedIsIgnored) asserts the
new contract: a diverged volume produces no plan operations from the
reconciler.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Guillaume Lours <glours@users.noreply.github.com>
Two related changes to the executor, plus the small cleanups they
attracted in review:

* Rename node consults a planner-set CreateNodeID instead of walking
  ancestors. The old execRenameContainer searched node.DependsOn for a
  CreateContainer result and fell back to a recursive walk through the
  group chain. That worked only by convention; a future op with cross-
  node data needs would have to rediscover or copy the pattern. Now the
  rename op carries an explicit CreateNodeID int set at plan time and
  the executor reads pctx[CreateNodeID].ContainerID directly. The
  recursive findCreatedIDInChain is gone.

* Stop re-listing containers on every create. execCreateContainer used
  to call getContainersByService(ctx, projectName) — a fresh
  ContainerList per create — to resolve service references at execute
  time. The executor now holds a live containersByService view seeded
  from ObservedState (via observed.containersByService()) and grown as
  OpCreateContainer nodes complete, so service references resolve from
  memory. On OpRemoveContainer the removed container is dropped from
  the view via slices.DeleteFunc, so a dependent's create that resolves
  network_mode: service:x against the just-removed container cannot
  pick up a stale ID (Containers.sorted() orders by canonical name and
  would otherwise return the removed container).

* Defensive slices.Clone of op.Service.VolumesFrom in execCreateContainer.
  resolveServiceReferences mutates VolumesFrom in place, and the
  shallow struct copy of *op.Service still shares the backing array.
  Single-execution-per-node makes it safe today, but the clone removes
  the trap for any future parallel-execution mode.

* Operation gains a CreateNodeID int (not a *PlanNode pointer) to avoid
  a structural cycle between Operation and PlanNode. OperationType
  values are pinned to explicit integers so adding an op in the middle
  cannot silently shift the others.

* execRenameContainer carries two checks so a missing CreateNodeID and
  an empty produced ID are distinguishable in logs. Both are programmer
  invariants (prefixed "internal:").

* containersByServiceFromObserved moved from a package-level helper to
  a method on *ObservedState.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Guillaume Lours <glours@users.noreply.github.com>
Three improvements identified in the Principal Engineer pass but
deliberately deferred:

1. Test fidelity. Split executePlan into newPlanExecutor (constructs
   the executor seeded from observed state) and (*planExecutor).run
   (walks the DAG). Production callers go through executePlan
   unchanged. TestExecutePlanRemoveContainerDropsFromCache now uses
   newPlanExecutor + run, exercising the same errgroup, done-channel
   and group-tracker wiring as production instead of a hand-rolled
   loop over executeNode.

2. //nolint:unused chain. The three preserved helpers
   (reconciler.prompt, planRecreateVolume, servicesUsingVolume) each
   carried a separate "kept for future" comment. Consolidate the
   rationale on the reconciler.prompt field doc and point the helper
   nolint directives there, so a future cleanup is a single grep.

3. Concurrency test. Add TestExecutePlanConcurrentRemovesCacheCoherence
   which builds N independent Stop→Remove chains in one plan; the
   errgroup fans them out across goroutines that all hit
   containersByService under the mutex. Passes under -race. Failure
   would expose a missing or incorrect lock.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Guillaume Lours <glours@users.noreply.github.com>
executor.go had grown to ~480 lines mixing three concerns:

  - DAG orchestration (run, executeNode, planExecutor struct)
  - per-OpType implementations (execCreateNetwork, …, execRenameContainer)
  - event tracking (groupTracker + per-OpType emit helpers)

Split into three files of ~170 lines each, one concern per file:

  - executor.go        — planExecutor, reconciliationContext, run, executeNode dispatch
  - executor_ops.go    — all execXxx methods
  - executor_events.go — groupTracker + emitStartEvent / emitDoneEvent / emitErrorEvent

Pure refactor: no functional change. Tests (incl. -race) and lint both pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Guillaume Lours <glours@users.noreply.github.com>
@glours glours force-pushed the reconciliation_plan branch from aaa34eb to 3900c50 Compare May 28, 2026 14:36
glours and others added 2 commits May 28, 2026 17:05
Three fixes surfaced by an external code review of the new reconciler:

1. Scale-down now propagates through serviceNodes. When a service is
   scaled down (all containers in excess), reconcileService used to
   continue without assigning lastNode, leaving r.serviceNodes[svc]
   unset. Dependent services then declared no edge on the scale-down
   ops and could start before the cleanup finished. Track the
   RemoveContainer node as lastNode so depends_on chains pick it up.

2. mustRecreate errors are no longer silently ignored by sortContainers.
   The comparator used `obsi, _ := r.mustRecreate(...)`, falling back
   to false on any hashing error. Pre-compute obsolescence into a map
   keyed by container ID before sorting and propagate the error to
   reconcileService.

3. A container is no longer Stopped twice when its network and its
   config both diverge. planRecreateNetwork already stops the affected
   container as part of the disconnect/remove/recreate dance; the
   subsequent planRecreateContainer (triggered via hasNetworkMismatch)
   used to add another OpStopContainer against the now-stopped target.
   Track stops in r.stoppedByPlan; planRecreateContainer reuses an
   existing Stop node when present, and chains its Remove on both that
   Stop and the replacement Create.

Two golden tests (TestReconcileNetworks_Diverged*) are updated to
reflect the new, dedupe'd plan shape (one Stop instead of two per
recreated container).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Guillaume Lours <glours@users.noreply.github.com>
Two coverage gaps surfaced by an external review:

- TestReconcileContainers_DependsOnChain: asserts that a service B with
  depends_on: [A] produces a CreateContainer for B that depends on A's
  last plan node (the serviceNodes mechanism in infrastructureDeps).
  This was the only depends_on-via-plan-DAG behavior untested before.

- TestReconcileContainers_DependsOnScaleDown: companion test that
  exercises the scale-down → dependent path specifically, verifying
  that the previous commit's lastNode-on-scale-down fix actually wires
  the dependency through.

- TestOperationTypeString: adds OpRunProvider to the table; all other
  OperationType values were already covered.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Guillaume Lours <glours@users.noreply.github.com>
@glours glours force-pushed the reconciliation_plan branch from f5336ce to 7bbbc77 Compare May 28, 2026 15:05
External reviewer noted that the alreadyStopped branch adds createNode
to removeDeps while the !alreadyStopped branch does not — semantically
correct but fragile, since it relies on the implicit invariant that
stopNode.DependsOn contains createNode in the !alreadyStopped path.

Spell out the invariant in a comment so a future maintainer who edits
the stop → create edge in the normal path knows they must also add
createNode unconditionally in the remove deps.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Guillaume Lours <glours@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants